Fundamental Techniques in Data Science with R
Everything for today (and more) can be found at
Data manipulation
for-loops, if statements
custom functions
Basic analysis (correlation & t-test)
Pipes
Data visualization with ggplot2
We assign elements to objects in R like this:
a <- 100 b <- a^2
We can combine elements into multiple types of objects:
Vectors, matrices and arrays can be either numeric or character. Objects that can combine numeric and character information are:
c(1, 3, 5, 2, 6, 4)
## [1] 1 3 5 2 6 4
v <- 1:6 v
## [1] 1 2 3 4 5 6
v[4]
## [1] 4
m <- matrix(1:6, nrow = 3, ncol = 2) m
## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6
m[1, ]
## [1] 1 4
m[, 2]
## [1] 4 5 6
d <- data.frame(numbers = 1:3,
letters = c("a", "b", "c"),
V3 = c(TRUE, FALSE, TRUE))
d
## numbers letters V3 ## 1 1 a TRUE ## 2 2 b FALSE ## 3 3 c TRUE
d$numbers
## [1] 1 2 3
d$letters
## [1] a b c ## Levels: a b c
l <- list(vector = v, dframe = d) l
## $vector ## [1] 1 2 3 4 5 6 ## ## $dframe ## numbers letters V3 ## 1 1 a TRUE ## 2 2 b FALSE ## 3 3 c TRUE
l$vector
## [1] 1 2 3 4 5 6
l$dframe$numbers
## [1] 1 2 3
#) to clarify what you are doing
R-scripts
RStudio projectsFunctions have parentheses (). Names directly followed by parentheses always indicate functions. For example;
matrix() is a functionc() is a function(1 - 2) * 5 is a calculation, not a functionPackages give additional functionality to R.
By default, some packages are included. These packages allow you to do mainstream statistical analyses and data manipulation. Installing additional packages allow you to perform the state of the art in statistical programming and estimation.
The cool thing is that these packages are all developed by users. The throughput process is therefore very timely:
A list of available packages can be found on CRAN
Functions have parentheses (). Names directly followed by parentheses always indicate functions. For example;
matrix() is a functionc() is a function(1 - 2) * 5 is a calculation, not a functionPackages extend the basic functionality of R.
There are two ways to load a package in R
library(stats)
and
require(stats)
The easiest way to install e.g. package mice is to use
install.packages("mice")
Alternatively, you can also do it in RStudio through
Tools --> Install Packages
Use common sense and BE CONSISTENT.
Browse through the tidyverse style guide
If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers and collaborators out of their rhythm when they go to read it. Try to avoid this.
Intentional spacing makes your code easier to interpret
a<-c(1,2,3,4,5) vs;a <- c(1, 2, 3, 4, 5)at least put a space after every comma!
library(MASS) # for the cats data library(dplyr) # data manipulation library(haven) # in/exporting data library(magrittr) # pipes library(mice) # for the boys data library(ggplot2) # visualization
transform(): changing and adding columnsdplyr::filter(): row-wise selection (of cases)table(): frequency tablesclass(): object classlevels(): levels of a factorhaven::read_sav(): import SPSS datacor(): bivariate correlationsample(): drawing a samplet.test(): t-testMASS::cats datahead(cats)
## Sex Bwt Hwt ## 1 F 2.0 7.0 ## 2 F 2.0 7.4 ## 3 F 2.0 9.5 ## 4 F 2.1 7.2 ## 5 F 2.1 7.3 ## 6 F 2.1 7.6
str(cats)
## 'data.frame': 144 obs. of 3 variables: ## $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ... ## $ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ... ## $ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
fem.cats <- cats[cats$Sex == "F", ] dim(fem.cats)
## [1] 47 3
head(fem.cats)
## Sex Bwt Hwt ## 1 F 2.0 7.0 ## 2 F 2.0 7.4 ## 3 F 2.0 9.5 ## 4 F 2.1 7.2 ## 5 F 2.1 7.3 ## 6 F 2.1 7.6
heavy.cats <- cats[cats$Bwt > 3, ] dim(heavy.cats)
## [1] 36 3
head(heavy.cats)
## Sex Bwt Hwt ## 109 M 3.1 9.9 ## 110 M 3.1 11.5 ## 111 M 3.1 12.1 ## 112 M 3.1 12.5 ## 113 M 3.1 13.0 ## 114 M 3.1 14.3
heavy.cats <- subset(cats, Bwt > 3) dim(heavy.cats)
## [1] 36 3
head(heavy.cats)
## Sex Bwt Hwt ## 109 M 3.1 9.9 ## 110 M 3.1 11.5 ## 111 M 3.1 12.1 ## 112 M 3.1 12.5 ## 113 M 3.1 13.0 ## 114 M 3.1 14.3
dplyrfilter(cats, Bwt > 2, Bwt < 2.2, Sex == "F")
## Sex Bwt Hwt ## 1 F 2.1 7.2 ## 2 F 2.1 7.3 ## 3 F 2.1 7.6 ## 4 F 2.1 8.1 ## 5 F 2.1 8.2 ## 6 F 2.1 8.3 ## 7 F 2.1 8.5 ## 8 F 2.1 8.7 ## 9 F 2.1 9.8
class(cats$Sex)
## [1] "factor"
levels(cats$Sex)
## [1] "F" "M"
sort1 <- arrange(cats, Bwt) head(sort1)
## Sex Bwt Hwt ## 1 F 2.0 7.0 ## 2 F 2.0 7.4 ## 3 F 2.0 9.5 ## 4 M 2.0 6.5 ## 5 M 2.0 6.5 ## 6 F 2.1 7.2
sort2 <- arrange(cats, desc(Bwt)) head(sort2)
## Sex Bwt Hwt ## 1 M 3.9 14.4 ## 2 M 3.9 20.5 ## 3 M 3.8 14.8 ## 4 M 3.8 16.8 ## 5 M 3.7 11.0 ## 6 M 3.6 11.8
cor(cats[, -1])
## Bwt Hwt ## Bwt 1.0000000 0.8041274 ## Hwt 0.8041274 1.0000000
With [, -1] we exclude the first column
cor.test(cats$Bwt, cats$Hwt)
## ## Pearson's product-moment correlation ## ## data: cats$Bwt and cats$Hwt ## t = 16.119, df = 142, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.7375682 0.8552122 ## sample estimates: ## cor ## 0.8041274
What do we conclude?
plot(cats$Bwt, cats$Hwt)
Test the null hypothesis that the difference in mean heart weight between male and female cats is 0
t.test(formula = Hwt ~ Sex, data = cats)
## ## Welch Two Sample t-test ## ## data: Hwt by Sex ## t = -6.5179, df = 140.61, p-value = 1.186e-09 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -2.763753 -1.477352 ## sample estimates: ## mean in group F mean in group M ## 9.202128 11.322680
plot(formula = Hwt ~ Sex, data = cats)
If-statements
For-loops
Apply()
Writing your own functions
Often, we want to run some code only if some condition is true.
For example:
a <- 2 a > 5
## [1] FALSE
if (a > 5){
print("a is larger than 5.")
}
a <- 8
if (a > 5){
print("a is larger than 5.")
}
## [1] "a is larger than 5."
We can also specify something to be run if the condition is not true.
a <- 2
if (a > 5){
print("a is larger than 5.")
} else {
print("a is smaller than 5.")
}
## [1] "a is smaller than 5."
a <- 8
if (a > 5){
print("a is larger than 5.")
} else {
print("a is smaller than 5.")
}
## [1] "a is larger than 5."
For loops are used when we want to perform some repetitive calculations.
It is often tedious, or even impossible, to write this repetition out completely.
# Let's print the numbers 1 to 6 one by one. print(1)
## [1] 1
print(2)
## [1] 2
print(3)
## [1] 3
print(4)
## [1] 4
print(5)
## [1] 5
print(6)
## [1] 6
For-loops allow us to automate this!
for (i in 1:6){
print(i)
}
## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6
for (i in 1:6){
print(i < 5)
}
## [1] TRUE ## [1] TRUE ## [1] TRUE ## [1] TRUE ## [1] FALSE ## [1] FALSE
for (i in 1:nrow(cats)){
if (cats$Bwt[i] > 2.5){
cat(i, "is over 2.5. It is:", cats$Bwt[i], "\n")
}
}
## 37 is over 2.5. It is: 2.6 ## 38 is over 2.5. It is: 2.6 ## 39 is over 2.5. It is: 2.6 ## 40 is over 2.5. It is: 2.7 ## 41 is over 2.5. It is: 2.7 ## 42 is over 2.5. It is: 2.7 ## 43 is over 2.5. It is: 2.9 ## 44 is over 2.5. It is: 2.9 ## 45 is over 2.5. It is: 2.9 ## 46 is over 2.5. It is: 3 ## 47 is over 2.5. It is: 3 ## 73 is over 2.5. It is: 2.6 ## 74 is over 2.5. It is: 2.6 ## 75 is over 2.5. It is: 2.6 ## 76 is over 2.5. It is: 2.6 ## 77 is over 2.5. It is: 2.6 ## 78 is over 2.5. It is: 2.6 ## 79 is over 2.5. It is: 2.7 ## 80 is over 2.5. It is: 2.7 ## 81 is over 2.5. It is: 2.7 ## 82 is over 2.5. It is: 2.7 ## 83 is over 2.5. It is: 2.7 ## 84 is over 2.5. It is: 2.7 ## 85 is over 2.5. It is: 2.7 ## 86 is over 2.5. It is: 2.7 ## 87 is over 2.5. It is: 2.7 ## 88 is over 2.5. It is: 2.8 ## 89 is over 2.5. It is: 2.8 ## 90 is over 2.5. It is: 2.8 ## 91 is over 2.5. It is: 2.8 ## 92 is over 2.5. It is: 2.8 ## 93 is over 2.5. It is: 2.8 ## 94 is over 2.5. It is: 2.8 ## 95 is over 2.5. It is: 2.9 ## 96 is over 2.5. It is: 2.9 ## 97 is over 2.5. It is: 2.9 ## 98 is over 2.5. It is: 2.9 ## 99 is over 2.5. It is: 2.9 ## 100 is over 2.5. It is: 3 ## 101 is over 2.5. It is: 3 ## 102 is over 2.5. It is: 3 ## 103 is over 2.5. It is: 3 ## 104 is over 2.5. It is: 3 ## 105 is over 2.5. It is: 3 ## 106 is over 2.5. It is: 3 ## 107 is over 2.5. It is: 3 ## 108 is over 2.5. It is: 3 ## 109 is over 2.5. It is: 3.1 ## 110 is over 2.5. It is: 3.1 ## 111 is over 2.5. It is: 3.1 ## 112 is over 2.5. It is: 3.1 ## 113 is over 2.5. It is: 3.1 ## 114 is over 2.5. It is: 3.1 ## 115 is over 2.5. It is: 3.2 ## 116 is over 2.5. It is: 3.2 ## 117 is over 2.5. It is: 3.2 ## 118 is over 2.5. It is: 3.2 ## 119 is over 2.5. It is: 3.2 ## 120 is over 2.5. It is: 3.2 ## 121 is over 2.5. It is: 3.3 ## 122 is over 2.5. It is: 3.3 ## 123 is over 2.5. It is: 3.3 ## 124 is over 2.5. It is: 3.3 ## 125 is over 2.5. It is: 3.3 ## 126 is over 2.5. It is: 3.4 ## 127 is over 2.5. It is: 3.4 ## 128 is over 2.5. It is: 3.4 ## 129 is over 2.5. It is: 3.4 ## 130 is over 2.5. It is: 3.4 ## 131 is over 2.5. It is: 3.5 ## 132 is over 2.5. It is: 3.5 ## 133 is over 2.5. It is: 3.5 ## 134 is over 2.5. It is: 3.5 ## 135 is over 2.5. It is: 3.5 ## 136 is over 2.5. It is: 3.6 ## 137 is over 2.5. It is: 3.6 ## 138 is over 2.5. It is: 3.6 ## 139 is over 2.5. It is: 3.6 ## 140 is over 2.5. It is: 3.7 ## 141 is over 2.5. It is: 3.8 ## 142 is over 2.5. It is: 3.8 ## 143 is over 2.5. It is: 3.9 ## 144 is over 2.5. It is: 3.9
apply() familyapply()The apply family is a group of very useful functions that allow you to easily execute a function of your choice over a list of objects, such as a list, a data.frame, or matrix.
We will look at three examples:
apply
sapply
lapply
apply()apply is used for matrices (and sometimes dataframes). It can take a function that takes a vector as input, and apply it to each row or column.
apply()MARGIN is 1 for rows, 2 for columns.
apply(cats[, -1], MARGIN = 2, mean)
## Bwt Hwt ## 2.723611 10.630556
But we’ve seen this done easier:
colMeans(cats[, -1])
## Bwt Hwt ## 2.723611 10.630556
However, the power of apply() is that it can use any function we throw at it.
apply()set.seed(123) rand.mat <- matrix(rnorm(21), nrow = 3, ncol = 7) rand.mat
## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] -0.5604756 0.07050839 0.4609162 -0.4456620 0.4007715 1.7869131 ## [2,] -0.2301775 0.12928774 -1.2650612 1.2240818 0.1106827 0.4978505 ## [3,] 1.5587083 1.71506499 -0.6868529 0.3598138 -0.5558411 -1.9666172 ## [,7] ## [1,] 0.7013559 ## [2,] -0.4727914 ## [3,] -1.0678237
apply(rand.mat, MARGIN = 1, FUN = max)
## [1] 1.786913 1.224082 1.715065
apply(rand.mat, MARGIN = 2, FUN = max)
## [1] 1.5587083 1.7150650 0.4609162 1.2240818 0.4007715 1.7869131 0.7013559
apply()rand.mat
## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] -0.5604756 0.07050839 0.4609162 -0.4456620 0.4007715 1.7869131 ## [2,] -0.2301775 0.12928774 -1.2650612 1.2240818 0.1106827 0.4978505 ## [3,] 1.5587083 1.71506499 -0.6868529 0.3598138 -0.5558411 -1.9666172 ## [,7] ## [1,] 0.7013559 ## [2,] -0.4727914 ## [3,] -1.0678237
apply(rand.mat, MARGIN = 1, FUN = sum)
## [1] 2.414327470 -0.006127405 -0.643547721
apply(rand.mat, MARGIN = 2, FUN = var)
## [1] 1.3000250 0.8704518 0.7717828 0.6972991 0.2405855 3.6373788 0.8104470
sapply()sapply() is used on list-objects and returns a matrix
my.list <- list(A = c(4, 2, 1:3), B = "Hello.", C = TRUE) sapply(my.list, class)
## A B C ## "numeric" "character" "logical"
sapply(my.list, range)
## A B C ## [1,] "1" "Hello." "1" ## [2,] "4" "Hello." "1"
It returns a vector or a matrix, depending on the output of the function.
Why is each element a character string?
sapply()Any data.frame is also a list, where each column is one list-element.
class(cats)
## [1] "data.frame"
is.list(cats)
## [1] TRUE
This means we can use sapply on data frames as well, which is often useful.
sapply(cats, class)
## Sex Bwt Hwt ## "factor" "numeric" "numeric"
lapply()lapply() is exactly the same as sapply(), but it returns a list instead of a vector.
lapply(cats, class)
## $Sex ## [1] "factor" ## ## $Bwt ## [1] "numeric" ## ## $Hwt ## [1] "numeric"
Functions are reusable pieces of code that take an input, do some computation on the input, and return output.
We have been using a lot of functions: code of the form something() is usually a function.
mean(1:6)
## [1] 3.5
The apply class of functions is very flexible and lightning fast, when compared to manual operations that could easily be defined in terms of functions.
The only caveat is that you need a function to apply. Many such functions are already available in R, such as mean(), mode(), sum(), cor(), and so on.
However, if you need to perform more than a simple calculation, it is often necessary to create your own function. In R functions take the following form
myfunction <- function(arguments){
hereyourfunctioncode
}
mean.sd <- function(argument1, argument2){
mean1 <- mean(argument1)
mean2 <- mean(argument2)
sd1 <- sd(argument1)
sd2 <- sd(argument2)
result <- data.frame(mean = c(mean1, mean2),
sd = c(sd1, sd2),
row.names = c("first", "second"))
return(result)
}
The above function calculates the means and standard deviations for two sources of input, then combines these statistics in a simple data frame and returns the data frame.
The sources of input are defined in the function arguments argument1 and argument2.
The reason why we have to specify function arguments is simple:
\[\text{EVERYTHING THAT HAPPENS IN A FUNCTION COMES FROM THE}\] \[\text{FUNCTION AND STAYS IN THE FUNCTION!}\]
This is because a function opens a seperate environment that only exists for as long as the function operates. This means:
To get information from the global environment to the function’s environment, we need arguments. To properly return information to the global environment, we should use return(). In general, using return() makes it explicit what your function’s return is. For complicated functions this is proper coding procedure, but for simple functions it is not strictly necessary.
To put this example function to the test:
mean.sd(argument1 = 1:10,
argument2 = 3:8)
## mean sd ## first 5.5 3.027650 ## second 5.5 1.870829
or, simply:
mean.sd(1:10, 3:8)
## mean sd ## first 5.5 3.027650 ## second 5.5 1.870829
boys <-
read_sav("boys.sav") %>%
head()
It effectively replaces head(read_sav("boys.sav")).
Let’s assume that we want to load data, change a variable, filter cases and select columns. Without a pipe, this would look like
boys <- read_sav("boys.sav")
boys2 <- transform(boys, hgt = hgt / 100)
boys3 <- filter(boys2, age > 15)
boys4 <- subset(boys3, select = c(hgt, wgt, bmi))
With the pipe:
boys <-
read_sav("boys.sav") %>%
transform(hgt = hgt/100) %>%
filter(age > 15) %>%
subset(select = c(hgt, wgt, bmi))
Benefit: a single object in memory that is easy to interpret
Your code becomes more readable:
f(x) becomes x %>% f()rnorm(10) %>% mean()
## [1] -0.2751301
f(x, y) becomes x %>% f(y)boys %>% cor(use = "pairwise.complete.obs")
## hgt wgt bmi ## hgt 1.0000000 0.6100784 0.1758781 ## wgt 0.6100784 1.0000000 0.8841304 ## bmi 0.1758781 0.8841304 1.0000000
h(g(f(x))) becomes x %>% f %>% g %>% hboys %>% subset(select = wgt) %>% na.omit() %>% max()
## [1] 117.4
%>% pipe%$% pipe. in a pipeIn a %>% b(arg1, arg2, arg3), a will become arg1. With . we can change this.
cats %>% plot(Hwt ~ Bwt, data = .)
VS
cats %$% plot(Hwt ~ Bwt)
The
. can be used as a placeholder in the pipe.
cats %$% t.test(Hwt ~ Sex)
## ## Welch Two Sample t-test ## ## data: Hwt by Sex ## t = -6.5179, df = 140.61, p-value = 1.186e-09 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -2.763753 -1.477352 ## sample estimates: ## mean in group F mean in group M ## 9.202128 11.322680
is the same as
t.test(Hwt ~ Sex, data = cats)
ggplot2anscombe dataanscombe
## x1 x2 x3 x4 y1 y2 y3 y4 ## 1 10 10 10 8 8.04 9.14 7.46 6.58 ## 2 8 8 8 8 6.95 8.14 6.77 5.76 ## 3 13 13 13 8 7.58 8.74 12.74 7.71 ## 4 9 9 9 8 8.81 8.77 7.11 8.84 ## 5 11 11 11 8 8.33 9.26 7.81 8.47 ## 6 14 14 14 8 9.96 8.10 8.84 7.04 ## 7 6 6 6 8 7.24 6.13 6.08 5.25 ## 8 4 4 4 19 4.26 3.10 5.39 12.50 ## 9 12 12 12 8 10.84 9.13 8.15 5.56 ## 10 7 7 7 8 4.82 7.26 6.42 7.91 ## 11 5 5 5 8 5.68 4.74 5.73 6.89
anscombe %>% ggplot(aes(y1, x1)) + geom_point() + geom_smooth(method = "lm")
Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.
ggplot2?Layered plotting based on the book The Grammer of Graphics by Leland Wilkinsons.
With ggplot2 you
ggplot2 then takes care of the details
1: Provide the data
mice::boys %>% ggplot()
2: map variable to aesthetics
mice::boys %>% ggplot(aes(x = age, y = bmi))
3: state which geometric object to display
mice::boys %>% ggplot(aes(x = age, y = bmi)) + geom_point()
Create the plot
gg <- mice::boys %>% ggplot(aes(x = age, y = bmi)) + geom_point(col = "dark green")
Add another layer (smooth fit line)
gg <- gg + geom_smooth(col = "dark blue")
Give it some labels and a nice look
gg <- gg + labs(x = "Age", y = "BMI", title = "BMI trend for boys") + theme_minimal()
plot(gg)